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LIMITED-INFORMATION GOODNESS-OF-FIT 
TESTING OF DIAGNOSTIC CLASSIFICATION 
ITEM RESPONSE THEORY MODELS 


Mark Hansen, Li Cai, Scott Monroe, and Zhen Li 
CRESST/University of California, Los Angeles 


Abstract 

It is a well-known problem in testing the fit of models to multinomial data that the 
full underlying contingency table will inevitably be sparse for tests of reasonable 
length and for realistic sample sizes. Under such conditions, full-information test 
statistics such as Pearson’s X 2 and the likelihood ratio statistic G 2 are poorly 
calibrated. Limited-information fit statistics, such as the M 2 statistic proposed by 
Maydeu-Olivares & Joe (2006), have been suggested as possible alternatives to 
full-information tests in various modeling including item response theory models. 
In this study, we considered the application of M 2 to the goodness-of-fit testing of 
diagnostic classification models (e.g., Rupp, Templin, & Henson, 2010). Through 
a series of simulation studies, we found that M 2 is well calibrated across a range 
of diagnostic model structures. The sensitivity of the test statistic to detect various 
types of model misspecilication was also examined. M 2 was found to be sensitive 
to certain misspecifications of the item model (e.g., fitting disjunctive models to 
data generated according to a conjunctive model), errors in the Q-matrix, and 
local item dependence due to unmodeled testlet effects. On the other hand, M 2 
was largely insensitive to misspecifications in the distribution of higher-order 
latent dimensions and to the specification of an extraneous attribute. To 
complement the analyses of overall model goodness-of-fit, we investigated the 
potential utility of the Chen and Thissen (1997) local dependence statistic X 2 D for 
characterizing sources of misfit. The X 2 D statistic was found to be slightly 
conservative (with Type I error rates consistently below the nominal level) but 
still useful in identifying potential misspecifications. Patterns of local dependence 
arising due to model misspecifications are illustrated. Finally, we used the M 2 and 
X 2 d statistics to evaluate a diagnostic model lit to a data from the Trends in 
Mathematics and Science Study (TIMSS), drawing upon analyses previously 
conducted by Lee, Park, and Taylan (2011). 



1 Introduction 

Diagnostic classification models (see, e.g., Rupp, Templin, & Henson, 2010) have 
received increasing attention within the field of educational and psychological measurement. 
The popularity of these models may be largely due to their perceived ability to provide useful 
information concerning both examinees (classifying them according to their attribute 
profiles) and test items (describing the particular attributes that are relevant to or required in 
order to achieve a certain response). Despite these attractive features, it is important to note 
the potential for biased interpretations when such models are misspecified. Various authors 
have noted that methods for evaluating the fit of diagnostic models remain relatively 
underdeveloped (e.g., Maris & Bechger, 2009; Wilhelm & Robitzsch, 2009; Rupp et ah, 
2010). That said, there has been notable progress (e.g., Sinharay & Almond, 2009; Lai, Cui, 
& Gierl, 2012; de la Torre, 2008; Rupp & Templin, 2008; Kunina-Habenicht, Rupp, & 
Wilhelm, 2012). 

In this study, we examine the utility of limited information goodness-of-fit statistics 
(e.g., Bartholomew & Leung, 2002) for evaluating the fit of diagnostic classification models. 
Specifically, we apply the M 2 statistic proposed by Maydeu-Olivares & Joe (2006) to a range 
of diagnostic model structures and consider its calibration and sensitivity across a range of 
misspecifications. Various M 2 -type statistics have been applied to an increasing assortment 
of item response theory models (e.g., Cai, Maydeu-Olivares, Coffman, & Thissen, 2006; Joe 
& Maydeu-Olivares, 2010; Cai & Hansen, 2013; Cai & Monroe, 2014). Limited-information 
fit statistics have been suggested as a possible approach for evaluating the use of diagnostic 
models (e.g., Rupp et ah, 2010), and initial applications appear quite promising (Templin, 
2007; Jurich, Bradshaw, & DeMars, 2014). Here, we seek to continue to develop this line of 
work. We begin by describing the development of the statistic within the context of 
diagnostic modeling, then evaluate its perfonnance through a series of simulation studies. We 
then use the statistic to assess the fit of a diagnostic model to real data. Finally, we discuss 
some of the limitations and opportunities to further develop this work. 

2 General Diagnostic Classification Modeling Framework 

2.1 An Item Response Model For Diagnostic Classification 

In this section, we describe a higher-order, hierarchical diagnostic classification model 
(Cai, 2013a) to which we will apply the limited-information statistic. This model may be best 
understood as an extension of existing diagnostic modeling frameworks, such as the log- 
linear cognitive diagnosis model (LCDM; Henson, Templin, & Willse, 2009), which we use 
here as a starting point. Let there be a total of i — 1 , ... , / items. In this research we limit the 
scope to dichotomously scored items, noting that extending the theory to multiple-categorical 



data is straightforward. Let the response categories be coded as c — 0 (incorrect/no 
endorsement) or c = 1 (correct/endorsement). Let there be K dichotomous (0 — 1) underlying 
latent attribute variables, x — (x lf ... , x k , , x K )', where x k = 1 indicates mastery/possession 
of an attribute. The relationship between items and attributes is captured by the Q-matrix, 
which is IxK matrix of zeros and ones. The (i, k) th entry in the Q-matrix is denoted as q ik , 
and takes on a value of one if item i measures attribute k. 


Let 7)(l|x) be the category 1 response function for item i: 


Ti(c\x ) = 


1 

1 + expC-^)' 


( 1 ) 


where the linear predictor is 

Vi = cti +A' i h i (Q,x). (2) 

It follows that for category 0, the response function is 

T i (0\x) = 1.0-T i (l\x). (3) 

Then’s and A’s in the linear predictor are the item parameters, and /q(Q, x) is a 
potentially vector-valued function that defines how the measured attributes combine to create 
the linear predictor portion of the item response model in Equation (2). As noted by Rupp et 
al. (2010) and Choi, Rupp, & Pan (2013), placing certain constraints on the A’s yields several 
of the more commonly utilized diagnostic models. For instance, suppose that according to the 
Q-matrix item i measures attributes 1 and 2, and that successful solution of the item requires 
both attributes. The linear predictor may take the following deterministic-input noisy “and” 
gate (DINA; Ju nk er & Sijtsma, 2001) form with a single free parameter for the interaction 
term: 


T]i — a t + 0*! + 0x 2 + A i x 1 x 2 , (4) 

in which case /q(Q, x) = (x 1 ,x 2 ,x 1 x 2 )' contains the main effects and the second-order 
interaction, and = (0, 0, A t )' contains fixed parameters. Or perhaps the item only requires 
the mastery of either attribute 1 or attribute 2. Then the linear predictor of the IRT model 
may take the following detenninistic-input noisy “or” gate (DINO; Templin & Henson, 
2006) form with a single slope parameter set equal between the main effects and the 
interaction term in absolute magnitude, albeit the interaction tenn is of the opposite direction: 

Vi = at + AjXi + 1;X 2 - A[X 1 X 2 , (5) 

in order to reflect the specification that mastery of both attributes does not lead to a 
further increase in the logit, and in this case = (A t , A L , —X l )' contains linear restrictions. 



Yet another possibility is that each attribute contributes to some increase in the logit, and that 
the magnitude of the increase due to one attribute does not depend on the mastery of the 
other. In that case, the linear predictor might contain only the main effect terms, and the 
model would take the form of von Davier’s (2005) general diagnostic model or the 
compensatory reparameterized unified model (C-RUM; Hartz, 2002): 

Vt = a i + hxi + Atx 2 . (6) 

Let Yi be a random variable whose realization y t is a response to item i. Regardless of 
the exact form of the model or the number of attributes, the probability mass function of Y u 
conditional on x, is that of a Bernoulli: 

P(Y t = Yi\x ) = [T i (y i \x)y<[1.0 - 7^ I*)] 1 "* (7) 

2.2 Hierarchical and Higher-Order Extensions 

Conditional independence is a critical assumption for model building in all of IRT 
analysis. In the case of DCMs, it is customary to assume the independence of item responses 
given the attributes (e.g., Templin & Henson, 2006). That is, the conditional response pattern 
probability factors: 

n(y\x) = p(^Y t =y t 

'i=i 

where y — (y 1 , ... , y f )' is an 7x1 vector that contains the observed response pattern. 
However, this is unrealistic if item i happens to belong to a cluster of items dependent on the 
same stimulus in a passage-based reading assessment, or if it falls into a specific content 
subdomain, e.g., social aspects of quality of life, along with other items. A standard strategy 
in item factor analysis is to include additional random effects to account for potential residual 
dependence due to the common source of variation shared by a set of items. Let there be 
s = 1, ...,S such clusters of items. Furthennore, if we assume that the clusters are mutually 
exclusive, the response function for item i in category 1 becomes 

1 

TKikfs) - 1 + exp[ _ (aic + W Q, *) + (9) 

where P s is the item slope on specific factor/dimension <f s . Again the response function for 
category 0 becomes 

T i (0\x,Z s ) = 1.0-T i (l\x,5 s y (10) 

The model resembles a bifactor model or testlet model (Gibbons & Hedeker, 1992). An 
item is pennitted to load on at most one specific dimension. In this case, conditional 




independence may be more amenable: 

n(y\x, f lf ... ' &) = p ( P| Y i = Vi x - £l- - ' & 

Vi=i 

where § s . is a notational shorthand for the set of items that load on specific dimension 
s, md P(X t = yilx^g) = [T l (y l \x,Z s )] yt [T l (y l \x,l; s )] 1 - yi is again a Bernoulli probability 
mass function. 

Suppose the distribution of then’s are given by \x)g(jz 2 1*) d(^s\ x )^ i.e., the 

specific dimensions are conditionally independent given x. We may integrate all S specific 
dimensions out without a full S-dimensional integral. This is because we may utilize the 
familiar dimension reduction method (see Cai, Yang, & Hansen, 2011; Rijmen, 2009) 
developed for item bifactor analysis to transform the following S-fold integral 

n(y\x) = J n(y\x, I f s )g(f 1 \x)-g(f s \x)df 1 -df s , ( 12 ) 


s=l ie$ s 


into a series of one-dimensional integrations 

n(y\x) = inn P(Xi = yi\x,^ s ) g(^\x) - g^s\x)d^ ■■■d^ s 

s = 1 ie§ s 
S 

= n/] - \ p ( y i =yi\ x ’Zs)gtfs\x)dZ s , 

s=l ie§ s 


(13) 


which will vastly reduce the amount of time needed for maximum marginal likelihood 
based parameter estimation because these integrals must be numerically evaluated. 

As per de la Torre & Douglas (2004), we may further model the latent attribute profiles 
for individuals by regressing the x’s onp higher-order dimensions 0 . For example, we may 
use a multidimensional extension of the 2-parameter logistic model (Reckase, 2009) to relate 
the latent attributes to the latent dimensions: 


P O/c = 1|0) = n k (Q) = 


1 + exp [~(a k0 + a fcl^l + f a kpPp)l 


Again, if we assume conditional independence of the latent attributes given 6, we may 

write 


K 

n(x\0) = ]^[7r k (0)]* fc [l - 7 r fc (0)] 1_Xfe . 

k= 1 


( 15 ) 



When we combine n(y\x) from Equation (11) with n(x\6) from Equation (14), we see 
that the contribution to marginal likelihood from response pattern y can be obtained as: 


= 


n(y\x)n(x\G)dx 


g{0) do, 


(16) 


where the integral in the brackets is actually a 2 ^- 16 ^ summation over the (conditional) 
attribute profile probabilities for all x. 

3 Limited-information Goodness-of-fit Testing 

3.1 Maximum Marginal Likelihood Estimation 

Let y be a dxl vector that collects together all free parameters in the model. These 
include parameters from all I items (the a’s and A’s), parameters for the distribution of the 
specific dimensions g(J; s I*), the higher-order IRT model (the a’s), and the parameters from 
the distribution of the higher-order dimensions g(0). To emphasize the fact that the marginal 
likelihood in Equation (16) is a function of the parameters once the response pattern is 
observed (and considered fixed), let n y (y ) denote the marginal likelihood. 

For / items, the IRT model generates a total of C — 2 1 cross-classifications or possible 
item response patterns in the fonn of a contingency table. Based on a sample of N 
respondents, let the observed proportion associated with pattern 7 be denoted asp y . The 
sampling model for this contingency table is a multinomial distribution with C cells and N 
trials. The multinomial log-likelihood for the item parameters 7 is proportional to 

log L ( 7 ) oc N ^ p y log 7 T y ( 7 ), (1 7) 

y 

where the summation is over all C response patterns. Maximization of the log-likelihood 
(e.g., with the EM algorithm; Bock & Aitkin, 1981) leads to the maximum marginal 
likelihood estimator 7 . 

Upon finding 7 , the model generates model-implied probabilities for each response 
pattern ft y — n y (y) . Suppose the model-implied response pattern probabilities ft y are 

collected into a C x 1 vector 7T of all model-implied response pattern probabilities. By 
analogy, let a C x 1 vector 7T* contain the true (population) response pattern probabilities. 
Similarly, the observed proportions p u can be collected into a Cxi vector p. For example, 
for 3 items there are 2 3 = 8 item response patterns, and the response pattern probabilities 



and observed proportions are: 


7T* = 


/7*000\ 
/ ^001 » 
7*010 
7*011 
7*100 
7*101 

\ **110 j 
'^1 ll ' 


71 = 


/^ooo\ 

f 7fi)01 


/7*000(y)\ 
7*00 i(y) 


/Pooo\ 
1 Pool 1 

7*010 


7*010 (y) 


Poio 

7*011 


7ron (y) 


Poll 

7*100 


7*100 (y) 

- v - 

PlOO 

7*101 


7*101 (y) 


PlOl 

1 7*110 j 

\7Ti ii / 


, 7*110 (y) j 

\7*m(y)/ 


1 PllO j 

Vll J 


( 18 ) 


Using this setup, exactly correct model specification (i.e., the model fits perfectly) in 
the population can be understood as the statement that there exists y* such that n(y*) — 71 * 
The values y* may be taken as the true parameters to be estimated. 

Under correct model specification, from results in discrete multivariate analysis (e.g., 
Bishop, Fienberg, & Holland, 1975), the maximum likelihood estimator is consistent, 
asymptotically nonnal and asymptotically efficient, which can be summarized as follows: 

VN(?-yO * ^(O.F- 1 ), (19) 

where T — A'* [diag(7r*)] _1 A* is the dxd Fisher information matrix, with the 
Jacobian matrix A* defined as the Cxd matrix of all first-order partial derivatives of the 
response pattern probabilities with respect to the parameters, evaluated at y*: 

. dn(y*) 

A* = - , . 

dy' 

3.2 Distribution of Residuals under Maximum Likelihood Estimation 

It can be shown that the asymptotic distribution of (p — 7T*) is C-variate normal: 

VN(p — 7T*) -> 7Vc(0, S), (20) 


where S = diag(n *) — 7r*7r' is the multinomial covariance matrix. The cell residual vector 
(p-ft) is asymptotically C-variate normal under maximum likelihood estimation: 

VN(p~n)^M c (0, T), (21) 

where T = S — A*7 ? “ 1 A'*. 

The model also generates implied marginal probabilities. Consider the 3-item example 
from above. There are 3 first order marginal probabilities n t (i = 1,2,3), one per item. There 
are also 3 second order marginal probabilities ii^' for the unique item pairs ( 1 < i' < i < 
3). In general, these probabilities correspond to the / sets of univariate and /(/ — l)/2 sets 



of bivariate margins that can be obtained from the full C-dimensional contingency table using 
a reduction operator matrix (see e.g., Maydeu-Olivares & Joe, 2006). An example is given 
below: 


T*2 



0 0 
0 1 
1 0 
0 0 
0 0 
0 0 


0 1 
1 0 
1 0 
0 0 
0 0 
1 0 


1 1 
0 1 
1 0 
0 1 
1 0 
0 0 




( 22 ) 


where L is a fixed operator matrix of Os and Is that reduces the response pattern 
probabilities and proportions into marginal probabilities and proportions up to order 2. The 
vector n 2 — Llr = L7r(y) = 7T 2 (y ) contains all first and second order model-implied 
marginal probabilities, evaluated at the maximum likelihood estimate. Obviously p 2 = Lp is 
the vector of first and second order observed marginal proportions. 

A requirement on L is that it has full row rank. It implies that the marginal residual 
vector (p 2 -Ti 2 ) — L(p-ir) is a full ra nk linear transformation of the multinomial cell residual 
vector (p — n) . Therefore, the marginal residual vector (p 2 — ft 2 ) is also asymptotically 
nonnal: 

Viv(p 2 -7f 2 ) = VivL(p-7r)^Jv;(o,r 2 ), (23) 

and r 2 = LIT = LSL' - LA*:F _1 A*L' = S 2 - A 2 *5 7 “ 1 A 2 * , where S 2 = LEL' , and 
A 2 * = LA* is the Jacobian of the marginal probabilities: 

_ d7r(y*) _ dL7r(y*) _ d?r 2 (y*) 

2 * dy' dy' dy' 

The dimensionality r of the multivariate nonnal random vector is equal to the rank of 
L . The rank of A 2 * detennines local identification of the model from the marginal 
probabilities. If A 2 * has full column rank, the model is locally identified. For dichotomously 
scored item responses, r = 7 + /(/ — l)/2. 

3.3 A Proposed Test Statistic 

The full-information test statistics such as likelihood ratio G 2 and Pearson’s A 2 use 
residuals based on the full response pattern cross-classifications to test the fit of the model 
against the general multinomial alternative. The comparison between fr u and p u (on 
logarithmic or linear scales) leads to well-known goodness of fit statistics such as the 
likelihood ratio G 2 and Pearson’s X 2 : 



( 24 ) 


G 2 = 2N\ v u \og^ , X 2 = N V - Pu - — . 

4 — 1 

ll U 

Under the null hypothesis that the IRT model fits exactly, these two statistics have the 
same asymptotic reference distribution, which is a central chi-square with degrees-of- 
freedom equal to C-l-d (Bishop et ah, 1975). 

Unfortunately as the number of items increases, the number of response patterns 
increases exponentially. For more than a dozen or so dichotomous items, the contingency 
table upon which the multinomial is defined becomes sparse for most realistic N. It is well 
known that the asymptotic chi-square approximations for the full-infonnation test statistics 
break down under sparseness (see e.g., Bartholomew & Tzamourani, 1999), and the utility of 
the full-information overall goodness of fit indices for routine IRT applications is 
questionable at best. 

Recently, limited-information overall fit statistics such as Maydeu-Olivares and Joe’s 
(2006) M 2 have been developed. Limited-information fit statistics use residuals based on 
lower order (e.g., first and second order) margins of the contingency table. These lower order 
margins are far better filled when compared to the sparse full contingency table. There is 
growing awareness that limited-infonnation tests can maintain correct size and can be more 
powerful than full-information tests (Cai, Maydeu-Olivares, Coffman, & Thissen, 2006; Joe 
& Maydeu-Olivares, 2010; Cai & Hansen, 2013). Moreover, Templin (2007) and Jurich, 
Bradshaw, & DeMars (2014) have demonstrated the potential usefulness of limited- 
information statistics in evaluating the fit of diagnostic classification models, in particular. 
We are not aware, however, of a comprehensive study of the theoretical and empirical 
aspects of the limited-information fit testing approach for diagnostic classification models, 
over a wide range of model misspecifications, under reasonably realistic conditions, and 
implementing both the overall test and diagnostic indices in a publicly available software 
distribution. Let E = diag(n ) — nn' denote the multinomial covariance matrix evaluated at 
7 , and let E 2 = LEL'. Also evaluate the marginal Jacobian at 7 , 

^ dn 2 ( 7 ) 
dy' 

When A 2 has full column rank, the statistic 

M 2 = N (p 2 - 7 T 2 )'fi (p 2 - tt 2 ), (25) 

where n = S 2 -S 2 A 2 ( A 2 E 2 A 2 1 A 2 S 2 , is asymptotically chi-square distributed with r-d 
degrees-of-freedom under the null hypothesis that the model fits exactly in the population 



(Browne, 1984). To see this is the case, we first recognize that from Equation (14), 

VN(p 2 -tt 2 ) is asymptotically a normal random vector with zero means and covariance matrix 

E 2 -A 2 *F _1 A 2 *. By the continuous mapping theorem and the consistency of the maximum 

likelihood estimator, fi converges in probability to the limiting weight matrix 1 i m D. = Q, 

_1 

- 1-1 / ' _1 \ ' -1 

where fl = S 2 -E 2 A 2 * ( A 2 *S 2 A 2 * J A 2 *S 2 . The product 

(s 2 -A 2 *F -1 A 2 *j fl = I r -S 2 1 A 2 * ^A 2 *S 2 1 A 2 *j A 2 * is idempotent and its ra nk is equal to r-d. 

Therefore, by Cochran’s theorem and Slutsky’s theorem, M 2 is asymptotically chi-squared 
with r-d degrees-of-freedom. 

When the model does not fit exactly in the population, the quadratic form in M 2 
provides a mechanism for computing a Root Mean Square Error of Approximation (RMSEA; 
Browne & Cudeck, 1993; Maydeu-Olivares, 2013) type index to characterize the degree of 
model error. This is because the limiting mean of Vn(p 2 -ti 2 ) will no longer be zero when 

there does not exist ay* such thatTr(y*) = rr*. Let F = (p 2 -it 2 )^(P2 -tt 2 ) be the observed 
noncentrality. As per Browne and Cudeck (1993), an unbiased estimate of the population 
noncentrality is F* = F-df/N. The sample RMSEA based on M 2 is defined as a measure of 
the per degree-of- freedom noncentrality (truncated at 0): 


RMSEA - max 



(26) 


Confidence intervals of RMSEA may be easily computed from the noncentral chi- 
square distribution by following established procedures in Browne and Cudeck (1993). 

4 Simulation Studies 

In order to evaluate the performance of M 2 in the context of diagnostic classification 
modeling, we conducted a series of simulation studies. First, we evaluated the calibration of 
the test statistic when the fitted model was correctly specified (i.e., matched the generating 
model; the null condition). We then examined the power of M 2 to detect a variety of model 
misspecifications. 

For all simulation conditions, the test length was 24 items, and there were four latent 
attributes. A Q-matrix was obtained by randomly assigning each item to load onto exactly 
two of the four attributes. This random assignment was balanced, such that all two-attribute 
combinations were represented equally within the test. A higher-order DINA model was used 
in all data generation. Item parameters were randomly drawn from distributions of specified 



values that are typical of those found in empirical analyses of educational assessment data. 

Rather than sampling slopes and intercepts directly, distributions were instead obtained 
by specifying distributions of correct response probabilities for respondents that either lack or 
possess the requisite underlying attributes (i.e., “guessing” and “slipping,” respectively). 
These values were then transformed to match the LCDM parameterization (Henson et al., 
2009). Guessing parameters were drawn from a beta distribution with a mean of 0.2 and 
standard deviation of 0.05. The sampling distribution for the slipping parameters had a mean 
of 0.10 and standard deviation of 0.05. Item intercepts (oq) were computed from the guessing 
parameters (g,) in the following manner: 



This intercept, together with the slipping parameter (s,), was then used to obtain the 
slope parameter: 

In addition to the latent attribute variables, items were also influenced in some data 
generating conditions by six group-specific dimensions (i.e., testlet effects). The slopes of the 
items on these dimensions were equal within a data generating condition, /? = (0, 1, 2). 

Various higher-order structures were used to generate the distribution of the latent 
attributes (the x variables). Models with a single higher-order dimension utilized a one- 
parameter logistic (1PL) IRT model, with slope a = 1.5 for all items and intercepts of 
c x = -0.45, c 2 = 0.15, c 3 = -0.15, c 4 = 0.45. Scores for the higher-order dimension were 
sampled from either a standard normal density or from one of four non-normal densities 
illustrated in Figure 1. The non-normal distributions were parameterized as Davidian curves 
(Woods and Lin, 2009), each with mean 0 and variance 1. The four densities can be 
described as Bimodal (d 1 = -0.10, d 2 = 1.98), Extreme Bimodal {d 1 = -0.52, d 2 = 2.29), 
Right-skewed (9 X = 0.69, d 2 = 4.09), and Extreme Right-skewed (9 X = .79, d 2 — 6.58), 
where d ± and d 2 are the skewness and kurtosis coefficients, respectively. These densities are 
similar to those used previously in research on non-normal latent trait density estimation in 
IRT (see, e.g., Woods & Lin, 2009). 

Insert Figure 1 about here 

In addition to the univariate nonnal and non-nonnal higher-order distributions, we also 
generated data from a model in which the higher-order structure was multidimensional. For 
these conditions, attributes 1 and 2 loaded onto one dimension (0!), and attributes 3 and 4 
loaded onto a second dimension (0 2 ). The means of the two higher-order dimensions were 0, 



their variances were 1, and the correlation between domains varied across conditions 
(p = 0.4, 0.6, 0.8). 

The Q-matrix and the item parameters used in data generation are presented in Table 1. 
Figure 2 presents path diagrams for the three basic model structures: a higher-order DINA 
model (top panel), a DINA model with correlated higher-order dimensions (middle panel), 
and a higher-order DINA with testlet effects (bottom panel). For each data generating 
condition, 500 datasets were generated in three sample sizes (N = 500, 1000,2000). The 
fitted models used with each data generating condition are summarized in Table 2. The first 
rows of Table 2 describe the null conditions, and the remaining sections identify the various 
misspecified models fit within each generating condition. All data generation was conducted 
using the R software (R Development Core Team, 2008). Model estimation and goodness-of- 
fit computations were conducted with the flexMIRT® item response modeling software, 
version 2 (Cai, 2013b). 

Insert Tables 1 and 2 about here 
Insert Figure 2 about here 

4.1 Calibration of the Test Statistic 

Results for M 2 under the null conditions are shown in Table 3. Across the models 
evaluated — and for each sample size considered — the mean, variance, and empirical rejection 
rates obtained for the statistic across replications are close to what would be expected. Two- 
tailed Kohnogorov-Smirnov tests were used to evaluate the extent to which the observed 
distribution of M 2 differed from the expected chi-square reference distribution. At the 
a = 0.05 level, the Kohnogorov-Smirnov test was not significant under any of the null 
conditions. The quantile-quantile plots in Figure 3 also show good match between the 
observed and expected distributions of the test statistic. 

Insert Table 3 about here 
Insert Figure 3 about here 

Figure 4 shows a histogram of the empirical rejection rates for the Chen and Thissen 
(1997) Xl D statistic, across the 276 item pairs. These rejection rates were obtained by tallying 
the number of times Xl D exceeded the critical value for a = 0.05 and a chi-squared 
distribution with one degree of freedom. It seems that the rejection rate is generally below the 
nominal level, averaging between 0.025 and 0.028 for the null conditions. This result is 
consistent with results obtained with the statistic in item response theory models (Chen & 
Thissen, 1997; Liu & Maydeu-Olivares, 2013), where Xl D has been found to be somewhat 
conservative for dichotomous data. Although this may reduce the sensitivity of the index for 



making judgments of the statistical significance of local dependence for a single item pair, in 
practice we often use the index to evaluate the relative severity of local independence 
violations across the many item pairs, so conservativeness may in fact be a welcoming 
feature under multiplicity of testing. 

Insert Figure 4 about here 

4.2 Power to Detect Unmodeled Testlets 

The previous section demonstrated that M 2 had very good Type I error control when a 
hierarchical model was fit to data generated with testlets. Here, we present results obtained 
when a standard higher-order diagnostic model — that is, a model that ignores the testlet 
structure of the data generating model — is fit to the same data. As shown in Table 4, this 
model was rejected for every replication and at all a levels, regardless of the magnitude of 
the testlet effect or the sample size. It seems, then, that M 2 is quite sensitive to this type of 
model misspecification. 

Insert Table 4 about here 

Figure 5 provides heat maps depicting the average Xl D value (across replications) for 
each item pair. Darker values indicate larger values of the statistic. The patterns of local 
dependence revealed through Xl D are consistent with the known structure of the generating 
model. The fitted model does not adequately explain the strong covariation among items 
within each testlet, resulting in rather severe local dependence among these items. 

Insert Figure 5 about here 

4.3 Power to Detect Misspecifications of the Higher-order Structure 

Results presented in this section were obtained by fitting a standard higher-order DINA 
model (i.e., one in which the higher-order dimension is univariate normal) to data generated 
from models with non-standard higher-order structures. The data generating models included 
higher-order latent dimension distributions that were non-normal or bivariate normal. Table 5 
presents the M 2 empirical rejection rates for these conditions of model misspecification. 

Insert Table 5 about here 

Empirical rejection rates for the non-normal generating conditions are quite similar to 
the nominal a levels, indicating that M 2 is not sensitive to this type of misspecification. This 
result is consistent with a previous study examining the power of M 2 to detect non-normality 
in item response theory models (Li & Cai, 2012). 

The results for the data generated from models with two higher-order dimensions 
varied according to the correlation between these dimensions. Specifically, rejection rates 



increased as the correlation decreased (that is, as the higher-order structure of the generating 
model became less similar to the undimensional structure of the fitted model). When the 
correlation was p = 0.8, the rejection rates were only slightly elevated above the nominal a 
levels. Power was higher when p = 0.4, though still fairly low except for the largest sample 
size examined. 

4.4 Power to Detect Misspecifications of the Q-matrix or the Item Model Type 

Table 6 presents results for conditions in which there were errors in the Q-matrix of the 
fitted model or misspecification of the item model type. As described earlier, the Q-matrix 
errors included the addition or omission of paths (i.e., changing the values of individual Q- 
matrix elements, from 0 to 1 or vice versa) and the addition or omission of latent attributes 
(i.e., adding or deleting a column from the Q-matrix). For errors of model type, a 
compensatory (C-RUM) model was specified for either one item or for all items (the data 
were generated according to a DINA model for all items). 

Insert Table 6 about here 

The M 2 statistic was quite sensitive to three of the four Q-matrix specifications 
examined. There was good power to detect the addition or omission of paths, as well as to 
detect the omission of a latent attribute. In contrast, specification of extraneous attribute 
appeared to have no effect on the rejection rates, which were similar to the nominal a levels. 
Figure 6 shows the patterns of local dependence for these conditions, as reflected in the 
average Xl D values. The Xl D statistic clearly identified items with incorrect Q-matrix 
misspecifications, so long as the number of such misspecifications is small (the pattern of 
local dependence resulting from the omission of an attribute — which affects a much larger 
number of variables — would seem to be less interpretable). 

Insert Figure 6 about here 

When an extraneous attribute is included in the model, neither M 2 nor Xl D provide 
evidence of misspecification. However, inspection of the marginal attribute probabilities 
reveals that the expected probability of possessing the extraneous attribute is very close to 1 . 
In a DINA model, then, it seems that such a variable can be absorbed without any change in 
model fit. In the example here, the extraneous attribute (x 5 ) nearly always takes a value of 1. 
In those cases, the third-order interaction is the same as the corresponding second order 
interaction (that is, if x 5 = 1, thenXjXj X 5 = XjXA. As a result, the parameters estimated end 
up representing the same quantity (Aj Xj x5 ~ Aj x A 

The M 2 statistic appears to have some sensitivity to the misspecification of item type. 
However, when the error is limited to a single item, rejection rates are only slightly above the 



nominal rates. Power is substantially higher when this misspecification is applied to all items 
in the test. For the conditions examined, Xl D does not appear to be particularly infonnative, 
as shown in Figure 7. 

Insert Figure 7 about here 

5 Analysis of Empirical Data 

The results of the simulation study presented in the previous section suggest that fit 
statistics based on univariate and bivariate marginal subtables can be useful in identifying 
and characterizing model misfit. Here, we apply the proposed approach to an empirical 
example, using M 2 and the Chen and Thissen (1997) Xl D statistic to evaluate the fit of 
alternative diagnostic models to data from the 2007 Trends in Mathematics and Science 
Study (TIMSS). This example builds on prior work by Lee, Park, & Taylan (2011), who 
analyzed data from booklets 4 and 5 from the TIMSS 2007 fourth grade mathematics test. As 
part of their study, several teachers and content experts reviewed and coded the TIMSS test 
items according to the specific testing objectives described in the TIMSS 2007 framework. 
For the 25 items considered in the study, 15 unique testing objectives were identified (out of 
the 32 total objectives in the test framework). Accordingly, the Q-matrix proposed by Lee et 
al. (2011) consists of 25 rows (one for each item) and 15 columns (one for each attribute). 

There is a good deal of variation in the number of items measuring each attribute. Ten 
of the 15 objectives are measured by only two or three items, while the second and third 
attributes are measured by 16 and 1 1 items, respectively. Among the 25 items, the number of 
underlying attributes ranges from 1 (items 2, 9, 24) to 6 (item 14). 

Lee et al. (2011) specified a conjunctive (DINA) model for the items in their analysis. 
For the current study, we initially fit a higher-order version of this model (de la Torre & 
Douglas, 2004) using the Q-matrix exactly as reported in the earlier study to a sample of 564 
students from the United States. As shown in the first row of Table 7, the value of M 2 for this 
model was 391.0, with 259 degrees of freedom. The RMSEA based on M 2 has a value of 
0.030, with 90% confidence interval of (0.000, 0.036). 

Insert Table 7 about here 

Values of the Chen and Thissen (1997) Xl D statistic are presented for this model in the 
left panel of Figure 8. There are a handful of item pairs with fairly large Xl D values, 
indicating substantial local dependence. For illustrative purposes, we focused our attention 
here on the two item pairs with the largest Xl D values: items 1 — 5 (Xl D = 13.8) and items 
18 — 19 (Xl D = 38.0). Table 8 presents the observed marginal response proportions and 
model-implied probabilities from which the test statistic was computed. 



Insert Figure 8 about here 
Insert Table 8 about here 

We examined the two item pairs and their specification within the initial fitted model 
for potential causes of the observed local dependence. The items considered in this study 
have all been released (Foy & Olson, 2009) and were thus available for review. Items 18 and 
19 are administered as within a cluster or testlet (M031242A/B/C), and it was evident upon 
inspecting the items that a correct response to item 18 (M031242A) would seem to greatly 
simplify the task of answering item 19; the answer to the question 19 (M031242C) can quite 
simply be read from a table that the examinee is asked to complete for item 18. This may 
explain why the diagnostic model does not fully explain the covariation in responses between 
these two items. Although it would be possible to arrive at the correct answer to item 19 by 
applying the skills identified in the Q-matrix as being relevant, those skills are less necessary 
once an examinee answers item 18. In order to model this lack of independence between 
these items (conditional on the attributes), a testlet effect could be specified. It should be 
noted that item 20 (M031242C) is also part of the same item cluster. However, this item does 
not rely quite as directly on infonnation from items 18 or 19, so we chose to ignore local 
dependence between item 20 and items 18 and 19. Similarly, items 9 and 10 (M041258A and 
M041258B, respectively) are administered as a testlet but show very little evidence of local 
dependence, so no random effect is specified for this pair. 

Our review of items 1 and 5 identified two possible changes in the Q-matrix. First, 
although item 1 (M041052) had been described in the study by Lee et al. (2011) as requiring 
both attributes 1 and 2, it seemed to us that possession of attribute 1 would be sufficient to 
obtain the correct answer. Thus, we posited an alternative Q-matrix specification for item 1, 
with this item depending only on possession of attribute 1. After examination of item 5, we 
concluded that the specification of dependence on attributes 2 and 3 was reasonable. 
However, the relevance of attribute 8 was unclear. Thus, a second Q-matrix change was 
considered, with item 5 depending on attributes 2 and 3 but not attribute 8. 

In summary, our alternative model for the TIMSS data included three changes. Two 
changes were made to the Q-matrix: the values of elements q 12 and q 5 8 were changed from 
1 to 0. In addition, a testlet effect was added to account for the strong dependence between 
items 18 and 19. The total number of estimated parameters is 67 for the alternative model — 
one more than required for the initial model. The Q-matrix changes for items 1 and 5 do not 
affect the number of free parameters. For item 1, the interaction y 11x2 is fixed to 0, but the 
main effect is now estimated. Similarly, for item 5, the third-order interaction 75 , 2 x 3 x 8 is 
fixed, and 75 , 2 x 3 ^ estimated. The one additional parameter estimated in the alternative model 



is the slope parameter for the testlet effect (for identification, the slopes of items 18 and 19 
were constrained to be equal — i.e., (3 181 = (3 191 = (3). 

Overall goodncss-of-fit indices for the alternative model are presented in the second 
row of Table 7. The value ofM 2 for the alternative model was 330.3, with 258 degrees of 
freedom. The RMSEA based on M 2 has a value of 0.022, with 90% confidence interval of 
(0.000, 0.029). Results for the Chen & Thissen (1997) Xl D are shown in the second row of 
Table 8 and in the right panel of Figure 8. The fit of the alternative model is slightly 
improved over the initial model, both on the basis of the limited-information fit statistics and 
the infonnation-based indices (AIC and BIC). It should be noted, however, that this 
improvement is rather modest. In addition, while specification of the testlet effect seems to 
have fully accounted for the local dependence of items 18 and 19 (withXL D decreased from 
38.0 to 0.9), there remains some dependence between items 1 and 5 (Xl D only decreased 
from 13.8 to 1 1.0), despite the changes in model specification. 

Of course, for this empirical study, we cannot know the true generating model. That 
said, our analyses are primarily intended to illustrate the ways in which researchers might use 
the goodness-of-fit statistics to evaluate the fit of models, to characterize possible 
misspecifications, and to identify candidate alternative models (and then test these 
alternatives, though ideally this would be done with an independent sample of respondents). 

6 Discussion 

In this paper, we demonstrate the application of limited-infonnation fit statistics to 
diagnostic classification models. Through simulation studies we found that M 2 is well 
calibrated, closely matching its reference distribution. This result was observed across a wide 
range of conditions, including standard higher-order diagnostic classification models, 
hierarchical models (e.g., with testlet effects), and models with correlated higher-order 
dimensions. 

We also examined the sensitivity of M 2 to a number of different kinds of model 
misspecification, including (1) failure to account for testlet-type effects, (2) incorrect 
specification of higher-order distribution or structure (fitting higher-order models with 
univariate normal distributions of the higher-order dimension to data generated from non- 
normal or bivariate normal distributions, (3) errors in the Q-matrix, and (4) misspecifications 
of item type (C-RUM instead of DINA). The results here were mixed. M 2 was found to be 
highly sensitive to the presence of testlet effects and certain Q-matrix misspecifications 
(addition or omission of paths, omission of an attribute). In contrast, M 2 was largely 
insensitive to misspecifications of the higher-order structure. Misfit was not detected when 
the higher-order dimension was sampled from any of the non-normal distributions examined. 



When the generating model included correlated higher-order dimensions, power was 
relatively low for strongly correlated dimensions but increased as the correlation decreased. 

There are of course, many limitations to the work presented here. First, we have 
focused on the application of M 2 to dichotomous item response data. However, it would be 
useful to examine the performance of the test statistic with polytomous models. Prior work 
has suggested that even bivariate marginal subtables may be poorly filled as the number of 
categories increases, potentially reducing the utility of M 2 (Cai & Hansen, 2013). Various 
methods for collapsing the subtables have been proposed, and seem to perfonn well in 
applications of item response theory (Joe & Maydeu-Olivares, 2010; Maydeu-Olivares, Cai, 
& Hernandez, 2011; Cai & Hansen, 2013; Cai & Monroe, 2014). However, it would be 
useful to evaluate these approaches in the context of diagnostic classification modeling. 

A second limitation is that in our simulation study we held constant certain aspects of 
the data-generating model. For example, the number of attributes (4), the number of test 
items (24), the structure of the Q-matrix, and the item type (DINA) were the same across all 
conditions. Although this was done in order to put some bounds on the scope of this study, 
one might wonder whether the findings might differ if any of these features were allowed to 
vary. 

It is noteworthy that the test statistics utilized here — M 2 and the Chen and Thissen 
(1997) Xl D — have recently been implemented for diagnostic classifications models in 
commercially item response modeling software (Cai, 2013b). We expect that increased 
availability of these tools for evaluating fit will further clarify their potential utility, as well 
as their limitations. 
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8 Tables 


Table 1 


Data Generation for Simulation Study: Q-Matrix and Item Parameters 




Q -matrix 


a i 


Attribute Two-Way Interaction Terms 



Testlet Slope Parameters 


i 

<7i,i 

Qi,2 

9(,3 

tfi,4 

10,1X2 

10,1X3 10,1X4 10,2X3 10,2X4 

10,3X4 

hi 

h 2 

A, 3 

Pi, 4 

Pi, 5 

h 6 

i 

0 

1 

1 

0 

-1.29 


3.76 


(0,1,2) 






2 

1 

1 

0 

0 

-1.04 

3.00 



(0,1,2) 






3 

0 

1 

1 

0 

-1.36 


2.90 


(0,1,2) 






4 

1 

0 

1 

0 

-1.00 


2.06 


(0,1,2) 






5 

1 

0 

1 

0 

-1.36 


3.14 



(0,1,2) 





6 

0 

1 

0 

1 

-.95 


2.51 



(0,1,2) 





7 

0 

0 

1 

1 

-.87 



2.63 


(0,1,2) 





8 

0 

0 

1 

1 

-1.19 



3.61 


(0,1,2) 





9 

0 

1 

1 

0 

-.59 


2.52 




(0,1,2) 




10 

0 

1 

0 

1 

-.98 


2.90 




(0,1,2) 




11 

0 

1 

0 

1 

-1.29 


3.05 




(0,1,2) 




12 

0 

1 

0 

1 

-.74 


1.94 




(0,1,2) 




13 

1 

0 

1 

0 

-1.11 


2.92 





(0,1,2) 



14 

1 

0 

0 

1 

-1.10 


2.95 





(0,1,2) 



15 

1 

1 

0 

0 

-.80 

2.21 






(0,1,2) 



16 

1 

0 

0 

1 

-.84 


2.32 





(0,1,2) 



17 

1 

0 

0 

1 

-.92 


2.97 






(0,1,2) 


18 

1 

0 

0 

1 

-.81 


2.87 






(0,1,2) 


19 

1 

0 

1 

0 

-.85 


2.42 






(0,1,2) 


20 

1 

1 

0 

0 

-1.08 

2.45 







(0,1,2) 


21 

0 

0 

1 

1 

-1.81 



3.60 






(0,1,2) 

22 

1 

1 

0 

0 

-.91 

2.23 








(0,1,2) 

23 

0 

0 

1 

1 

-1.12 



2.67 






(0,1,2) 

24 

0 

1 

1 

0 

-1.15 


3.17 







(0,1,2) 


Note: In generating model (DINA), the attribute main effects (which are not shown) are fixed to zero. 



Table 2 


Summary of data generating and fitted models used in simulation study 



Data Generating Model 


Fitted Model 


Higher-Order Structure 

Attribute Model Testlets 

Higher-Order Structure 

Attribute Model 

Testlet 



null (correctly specified) models (results 

in Table 3) 



univariate normal 


DINA 

no 

univariate normal 

DINA 

no 

univariate normal 


DINA 

yes (P = 1) 

univariate normal 

DINA 

yes 

univariate normal 


DINA 

yes 0? = 2) 

univariate normal 

DINA 

yes 

bivariate normal (p 

= 0.4) 

DINA 

no 

bivariate normal 

DINA 

no 

bivariate normal (p 

= 0.6) 

DINA 

no 

bivariate normal 

DINA 

no 

bivariate normal (p 

= 0.8) 

DINA 

no 

bivariate normal 

DINA 

no 



failure to model testlet effects (results in Table 4) 



univariate normal 


DINA 

yes (p = 1) 

univariate normal 

DINA 

no 

univariate normal 


DINA 

yes (P = 2) 

univariate normal 

DINA 

no 



misspecifications of higher-order latent variable distributions (results in Table 5) 



univariate bimodal 


DINA 

no 

univariate normal 

DINA 

no 

univariate extreme bimodal 

DINA 

no 

univariate normal 

DINA 

no 

univariate right-skewed 

DINA 

no 

univariate normal 

DINA 

no 

univariate extreme right skewed 

DINA 

no 

univariate normal 

DINA 

no 

bivariate normal (p 

= 0.4) 

DINA 

no 

univariate normal 

DINA 

no 

bivariate normal (p 

= 0.6) 

DINA 

no 

univariate normal 

DINA 

no 

bivariate normal (p 

= 0.8) 

DINA 

no 

univariate normal 

DINA 

no 



Q-matrix 

or item type misspecifications (results in Table 6) 



univariate normal 


DINA 

no 

univariate normal 

C-RUM (item 8) 

no 

univariate normal 


DINA 

no 

univariate normal 

C-RUM (all items) 

no 

univariate normal 


DINA 

no 

univariate normal 

omit path 

no 

univariate normal 


DINA 

no 

univariate normal 

add path 

no 

univariate normal 


DINA 

no 

univariate normal 

omit attribute 

no 

univariate normal 


DINA 

no 

univariate normal 

add attribute 

no 









Table 3 

Simulation Study Results: M 2 Calibration under Null Conditions for Various Data Generating Models 


N 

reps 

df 

M 

V 

KS 


Empirical Rejection Rate 



.200 

.150 

.100 

.050 

.010 





higher-order DINA 






500 

500 

247 

248.4 

518.4 

.356 

.206 

.162 

.136 

.064 

.014 

1000 

500 

247 

248.0 

508.7 

.757 

.212 

.154 

.104 

.060 

.018 

2000 

500 

247 

247.0 

524.3 

.692 

.212 

.146 

.108 

.054 

.014 



higher-order DINA with testlet effects, testlet slope /? = 1 




500 

500 

241 

240.7 

450.9 

.931 

.200 

.140 

.078 

.040 

.002 

1000 

500 

241 

240.0 

436.5 

.787 

.186 

.128 

.076 

.036 

.006 

2000 

500 

241 

240.4 

499.5 

.198 

.198 

.156 

.116 

.050 

.004 



higher-order DINA with testlet effects, testlet slope /? = 2 




500 

499 

241 

242.2 

459.7 

.311 

.212 

.166 

.120 

.062 

.014 

1000 

500 

241 

242.6 

450.0 

.049 

.204 

.142 

.086 

.056 

.008 

2000 

500 

241 

240.3 

435.2 

.352 

.186 

.134 

.090 

.042 

.004 



DINA with bivariate normal higher-order latent distribution, p = 

0.4 



500 

496 

245 

245.0 

480.8 

.913 

.179 

.139 

.101 

.050 

.008 

1000 

499 

245 

246.3 

503.0 

.146 

.220 

.160 

.106 

.062 

.010 

2000 

500 

245 

243.3 

488.5 

.079 

.186 

.140 

.078 

.040 

.016 



DINA with bivariate normal higher-order latent distribution, p — 

0.6 



500 

483 

245 

245.5 

486.6 

.551 

.199 

.147 

.099 

.050 

.012 

1000 

498 

245 

246.6 

494.8 

.057 

.219 

.173 

.100 

.056 

.008 

2000 

500 

245 

244.0 

480.6 

.557 

.172 

.122 

.088 

.046 

.012 



DINA with bivariate normal higher-order latent distribution, p = 

0.8 



500 

394 

245 

244.2 

473.6 

.953 

.198 

.140 

.091 

.038 

.003 

1000 

464 

245 

244.2 

480.4 

.805 

.196 

.157 

.093 

.039 

.002 

2000 

492 

245 

244.1 

481.9 

.390 

.195 

.146 

.093 

.051 

.008 


Note: N is the sample size; reps is the number of converged replications (out of 500); df is the degrees of 
freedom for M 2 , given the fitted model; M and V observed the mean and variance, respectively, of M 2 across 
the converged replications within the condition; KS indicates the />-values for two-tailed Kolmogorov-Smirnov 
test. 



Table 4 

Simulation Study Results: M 2 Power to Detect Testlet Effects 


N 

reps 

df 

Empirical Rejection Rate 



RMSEA 

.200 

.150 

.100 

.050 

.010 

M 

(90% Cl) 



higher-order DINA with testlet effects, testlet slope /? = 1 



500 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.044 

(.036, .051) 

1000 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.044 

(.040, .048) 

2000 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.044 

(.041, .047) 



higher-order DINA with testlet effects, testlet slope /? = 2 



500 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.114 

(.107, .120) 

1000 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.115 

(.110, .120) 

2000 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.115 

(.111, .118) 


Note: N is the sample size; reps is the number of converged replications (out of 500); dfis the degrees of 
freedom for M 2 , given the fitted model. 



Table 5 

Simulation Study Results: M 2 Power to Detect Misspecifications of Higher-Order Latent Variable Distributions 


N 

reps 

df 


Empirical Rejection Rate 



RMSEA 

.200 

.150 

.100 

.050 

.010 

M 

(90% Cl) 



DINA with bimodal higher-order latent variable distribution 



500 

500 

247 

.184 

.134 

.090 

.044 

.006 

.006 

(.000, .017) 

1000 

500 

247 

.180 

.130 

.090 

.038 

.006 

.004 

(.000, .012) 

2000 

500 

247 

.216 

.168 

.100 

.056 

.010 

.003 

(.000, .009) 



DINA with extreme bimodal higher-order latent variable distribution 


500 

500 

247 

.220 

.156 

.096 

.058 

.012 

.006 

(.000, .018) 

1000 

500 

247 

.206 

.160 

.120 

.048 

.014 

.004 

(.000, .012) 

2000 

500 

247 

.220 

.176 

.116 

.052 

.008 

.003 

(.000, .009) 



DINA with right-skewed higher-order latent variable distribution 


500 

500 

247 

.188 

.128 

.086 

.028 

.006 

.005 

(.000, .017) 

1000 

500 

247 

.204 

.164 

.108 

.050 

.018 

.004 

(.000, .012) 

2000 

500 

247 

.216 

.160 

.110 

.038 

.008 

.003 

(.000, .009) 



DINA with extreme right-skewed higher-order latent variable distribution 


500 

500 

247 

.196 

.142 

.102 

.060 

.010 

.006 

(.000, .018) 

1000 

500 

247 

.242 

.176 

.124 

.062 

.008 

.004 

(.000, .013) 

2000 

500 

247 

.190 

.130 

.086 

.040 

.004 

.002 

(.000, .008) 



DINA with bivariate normal higher-order latent distribution, p = 

0.4 


500 

500 

247 

.290 

.224 

.158 

.088 

.022 

.007 

(.000, .019) 

1000 

500 

247 

.452 

.364 

.290 

.164 

.056 

.007 

(.000, .015) 

2000 

500 

247 

.626 

.536 

.450 

.312 

.134 

.007 

(.000, .012) 



DINA with bivariate normal higher-order latent distribution, p = 

0.6 


500 

500 

247 

.246 

.190 

.124 

.064 

.022 

.006 

(.000, .018) 

1000 

500 

247 

.330 

.252 

.174 

.092 

.026 

.006 

(.000, .014) 

2000 

500 

247 

.416 

.332 

.246 

.150 

.036 

.005 

(.000, .010) 



DINA with bivariate normal higher-order latent distribution, p = 

0.8 


500 

500 

247 

.236 

.170 

.100 

.050 

.014 

.006 

(.000, .017) 

1000 

500 

247 

.240 

.192 

.114 

.056 

.004 

.004 

(.000, .012) 

2000 

500 

247 

.238 

.184 

.130 

.066 

.014 

.003 

(.000, .009) 


Note: N is the sample size; reps is the number of converged replications (out of 500); df is the degrees of 
freedom for M 2 , given the fitted model. 



Table 6 


Simulation Study Results: M 2 Power to Detect to Q-Matrix or Item Type Misspecifications 


N 

reps 

df 


Empirical Rejection Rate 



RMSEA 

.200 

.150 

.100 

.050 

.010 

M 

(90% Cl) 

omit paths from attribute to items (x-l -» y 5 , 

*1 -> Yie) 






500 

500 

247 

.978 

.968 

.940 

.888 

.730 

.024 

(.015, .032) 

1000 

500 

247 

1.000 

1.000 

1.000 

1.000 

.998 

.024 

(.019, .029) 

2000 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.025 

(.021, .028) 

add (extraneous) paths from attribute to items (x x -» y 3 , x x -» 

Y 23 ) 




500 

500 

247 

.920 

.892 

.850 

.772 

.566 

.021 

(.010, .029) 

1000 

500 

247 

1.000 

1.000 

1.000 

1.000 

.984 

.022 

(.017, .027) 

2000 

500 

247 

1.000 

1.000 

1.000 

1.000 

1.000 

.022 

(.019, .025) 

omit attribute (x 4 ) 









500 

500 

248 

.472 

.392 

.306 

.206 

.064 

.010 

(.000, .021) 

1000 

500 

248 

.738 

.680 

.590 

.452 

.202 

.011 

(.000, .019) 

2000 

500 

248 

.974 

.956 

.940 

.888 

.740 

.012 

(.007, .016) 

add (extraneous) attribute (x 5 ) 








500 

500 

246 

.204 

.166 

.138 

.058 

.012 

.006 

(.000, .018) 

1000 

500 

246 

.206 

.154 

.108 

.056 

.018 

.004 

(.000, .012) 

2000 

499 

246 

.206 

.148 

.100 

.054 

.012 

.003 

(.000, .009) 

apply an 

incorrect item type (C-RUM for item 8) 






500 

500 

246 

.230 

.180 

.138 

.066 

.018 

.006 

(.000, .018) 

1000 

500 

246 

.246 

.186 

.128 

.074 

.018 

.004 

(.000, .013) 

2000 

500 

246 

.280 

.208 

.144 

.076 

.024 

.003 

(.000, .009) 

apply an 

incorrect item type (C-RUM for all items) 






500 

500 

223 

.966 

.952 

.932 

.866 

.710 

.025 

(.014, .034) 

1000 

500 

223 

1.000 

1.000 

1.000 

1.000 

.990 

.025 

(.020, .031) 

2000 

500 

223 

1.000 

1.000 

1.000 

1.000 

1.000 

.026 

(.022, .029) 


Note: N is the sample size; reps is the number of converged replications (out of 500); df is the degrees of 


freedom for M 2 , given the fitted model. 



Table 7 

Empirical Illustration: Overall Goodness-of-Fit Evaluation for the Two Fitted Models. 



df 

m 2 

P 

RMSEA 

(90% C.I.) 

AIC 

BIC 

model 1 

259 

391.0 

<.001 

.030 

(.000, .036) 

15872.4 

16158.5 

model 2 

258 

330.3 

.002 

.022 

(.000, .029) 

15821.3 

16111.8 


Table 8 

Empirical Illustration: Bivariate Marginal Response Pattern Observed Proportions and Model-Implied 
Probabilities for 2 Item Pairs (1 — 5 and 18 — 19) and Corresponding Xj^ Values for the Two Fitted Models 

Observed Model-Implied 

; ; ; ; — p 

Poo Pol PlO Pll tToO tToi tTlO t?ll 


items 1,5 


model 1 

.128 .073 .220 .580 

.097 

.103 

.252 

.549 

13.8 

<.001 

model 2 


.103 

.101 

.250 

.546 

11.0 

<.001 


items 18,29 

model 1 


.257 

.104 

.291 

.348 

38.0 

<.001 

model 2 

.314 .046 .232 .408 

.310 

.053 

.241 

.396 

0.9 

.331 




9 Figure Captions 


Figure 1. Densities from which scores on the higher-order dimension were sampled for 
non-normal data generating conditions. 

Figure 2. Path diagrams for data generating models used in simulation study. 

Figure 3. Simulation study results: Quantile-quantile plots of observed M 2 values and their 
reference chi-square distributions (degrees of freedom shown in the subscripts of 
the x-axis labels). Closed grey circles indicate results for conditions with sample 
size N = 500, open black circles indicate results for N= 1000, and plus signs 
indicate N= 2000. Reported p-values are for a two-tailed Kohnogorov-Smirnov 
test of the equality of the observed M 2 distributions with its corresponding 
reference distribution. 

Figure 4. Simulation study results: Empirical rejection rates for Chen and Thissen (1997) 
Xl D across all item pairs. Results are shown for null conditions (fitted model 
correctly specified) with sample size N = 2000 and a = 0.05. Rejection rates are 
based on all converged replications (minimum of 492; maximum of 500). 

Figure 5. Simulation study results: Chen and Thissen (1997) X^ D heat map for diagnostic 
models with testlet effects. Shading is based on the average RMSEA for the item 
pair across all converged replications. The generating model in both panels is a 
higher-order DINA model with testlet effects. The testlet slope parameters are 
/? = 1 and /? = 2 for the left and right panels, respectively. The fitted models do 
not include the testlet effects. 

Figure 6. Simulation study results: Chen and Thissen (1997) Xl D heat map for diagnostic 
models with incorrect specification of item type. Shading is based on the average 
RMSEA for the item pair across all converged replications. The generating model 
is a higher-order DINA model. In left panel, item 8 (indicated by tick mark on the 
x- and y-axes) is incorrectly specified as C-RUM. In right panel, all 24 items are 
incorrectly specified as C-RUM. 

Figure 7. Simulation study results: Chen and Thissen (1997) Xl D heat map for diagnostic 
models with Q-matrix misspecification. Shading is based on the average RMSEA 
for the item pair across all converged replications. The generating model in all 
four panels is a higher-order DINA model. In top-left panel, paths from x t to y 5 
and from x v to y 16 are omitted (i.e., the values of Q-matrix elements q 51 and 
916,1 were changed from 1 to 0; tickmarks identify items 5 and 16). In the top- 
right panel, extraneous paths were created from x x to y 3 and from x ± to y 23 (i.e., 
the values of Q-matrix elements q 31 and q 231 were changed from 0 to 1; 
tickmarks identify items 3 and 23). In the bottom-left panel, attribute x 4 is omitted 
(i.e., all values in column 4 of the Q-matrix are 0; 12 elements that have values of 
1 in the Q-matrix of the generating model are identified by tickmarks). In the 
bottom-right panel, an extraneous attribute, x 5 is specified (i.e., a 5th column is 
added to the Q-matrix); 4 items (2, 6, 8, and 13); indicated by tickmarks on the x- 
and y-axes) have Q-matrix values of 1 for this attribute. 



Figure 8. Results from empirical illustration: Chen and Thissen (1997) Xl D heat map for 
analysis of 25 items from TIMSS 4 th grade math booklet 4. Left panel shows 
results obtained by fitting a higher-order DINA model with Q-matrix as described 
by Lee, Park, & Taylan (2011). Right panel shows results obtained by fitting a 
model with a slightly altered Q-matrix and the addition of a testlet effect (for 
items 18 and 19). 
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